This is a Jupyter Notebook¶

A Jupyter Notebook is a data-science environment that lets you combine:

  1. Narrative: The text describing your analysis
  2. Code: The program that does the analysis
  3. Results: The output of the program





Example¶

Note: In this lecture there is a lot of code. You are not expected to know any of this yet. This is just a preview of the things you will see in the next few weeks.

We can use the tools of data science to study text. For example, here we will do some basic analysis of "Adventures of Huckleberry Finn" and from "Little Women". The first step is getting the data. The following is a tiny program to download text from the web.

In [1]:
# A tiny program to download text from the web.
def read_url(url): 
    from urllib.request import urlopen 
    import re
    return re.sub('\\s+', ' ', urlopen(url).read().decode())

Here we download the books from the data8 textbook website.

In [2]:
huck_finn_url = 'https://www.inferentialthinking.com/data/huck_finn.txt'
huck_finn_text = read_url(huck_finn_url)
huck_finn_chapters = huck_finn_text.split('CHAPTER ')[44:]
In [3]:
little_women_url = 'https://www.inferentialthinking.com/data/little_women.txt'
little_women_text = read_url(little_women_url)
little_women_chapters = little_women_text.split('CHAPTER ')[1:]

Let's look at the text from the first chapter of Huckleberry Finn:

Working with Tables¶

In this class you will use the Berkeley datascience library to manipulate and data.

In [4]:
from datascience import *
In [5]:
Table().with_column('Chapters', huck_finn_chapters)
Out[5]:
Chapters
I. YOU don't know about me without you have read a book ...
II. WE went tiptoeing along a path amongst the trees bac ...
III. WELL, I got a good going-over in the morning from o ...
IV. WELL, three or four months run along, and it was wel ...
V. I had shut the door to. Then I turned around and ther ...
VI. WELL, pretty soon the old man was up and around agai ...
VII. "GIT up! What you 'bout?" I opened my eyes and look ...
VIII. THE sun was up so high when I waked that I judged ...
IX. I wanted to go and look at a place right about the m ...
X. AFTER breakfast I wanted to talk about the dead man a ...

... (33 rows omitted)

In [6]:
import numpy as np
In [7]:
np.char.count(huck_finn_chapters, 'Tom')
Out[7]:
array([ 6, 24,  5,  0,  0,  0,  2,  2,  0,  0,  2,  3,  1,  0,  0,  0,  3,
        5,  0,  0,  0,  0,  0,  0,  0,  0,  0,  1,  0,  0,  1,  4, 19, 15,
       14, 18,  9, 32, 11, 11,  8, 30,  6])
In [8]:
np.char.count(huck_finn_chapters, 'Jim')
Out[8]:
array([ 0, 16,  0,  8,  0,  0,  0, 22, 11, 19,  4, 20,  9,  6, 16, 28,  0,
       10, 13, 18,  1,  0,  9,  5,  0,  0,  0,  1,  3,  5, 17,  0,  5, 17,
       18, 23,  4, 27, 10, 13,  0, 12,  6])
In [9]:
counts = Table().with_columns([
    'Tom', np.char.count(huck_finn_chapters, 'Tom'),
    'Jim', np.char.count(huck_finn_chapters, 'Jim'),
    'Huck', np.char.count(huck_finn_chapters, 'Huck'),
])
counts
Out[9]:
Tom Jim Huck
6 0 3
24 16 2
5 0 2
0 8 1
0 0 0
0 0 2
2 0 0
2 22 5
0 11 1
0 19 0

... (33 rows omitted)

We will Learn to Visualize Data¶

In [10]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

Plot the cumulative counts: How many times in Chapter 1, how many times in Chapters 1 and 2, and so on.

In [11]:
cum_counts = counts.cumsum().with_column('Chapter', np.arange(1, 44, 1))
cum_counts.plot(column_for_xticks="Chapter")
plt.title('Cumulative Number of Times Name Appears');
/Users/jegonzal/miniconda3/envs/data8/lib/python3.9/site-packages/datascience/tables.py:305: FutureWarning: Implicit column method lookup is deprecated.
  warnings.warn("Implicit column method lookup is deprecated.", FutureWarning)
In [12]:
# The chapters of Little Women
Table().with_column('Chapters', little_women_chapters)
Out[12]:
Chapters
ONE PLAYING PILGRIMS "Christmas won't be Christmas witho ...
TWO A MERRY CHRISTMAS Jo was the first to wake in the gr ...
THREE THE LAURENCE BOY "Jo! Jo! Where are you?" cried Me ...
FOUR BURDENS "Oh, dear, how hard it does seem to take up ...
FIVE BEING NEIGHBORLY "What in the world are you going t ...
SIX BETH FINDS THE PALACE BEAUTIFUL The big house did pr ...
SEVEN AMY'S VALLEY OF HUMILIATION "That boy is a perfect ...
EIGHT JO MEETS APOLLYON "Girls, where are you going?" as ...
NINE MEG GOES TO VANITY FAIR "I do think it was the most ...
TEN THE P.C. AND P.O. As spring came on, a new set of am ...

... (37 rows omitted)

In [13]:
# Counts of names in the chapters of Little Women
names = ['Amy', 'Beth', 'Jo', 'Laurie', 'Meg']
mentions = {name: np.char.count(little_women_chapters, name) for name in names}
counts = Table().with_columns([
        'Amy', mentions['Amy'],
        'Beth', mentions['Beth'],
        'Jo', mentions['Jo'],
        'Laurie', mentions['Laurie'],
        'Meg', mentions['Meg']
    ])
In [14]:
# Plot the cumulative counts
Table.static_plots()
cum_counts = counts.cumsum().with_column('Chapter', np.arange(1, 48, 1))
cum_counts.plot(column_for_xticks=5)
plt.title('Cumulative Number of Times Name Appears');

We can use interactive tools.

In [15]:
# Plot the cumulative counts
Table.interactive_plots()
cum_counts = counts.cumsum().with_column('Chapter', np.arange(1, 48, 1))
cum_counts.plot(column_for_xticks=5)

Examining Length¶

In [16]:
len('Data 8')
Out[16]:
6
In [17]:
len(read_url(huck_finn_url))
Out[17]:
588035
In [18]:
# In each chapter, count the number of all characters;
# call this the "length" of the chapter.
# Also count the number of periods.

length_hf = Table().with_columns([
        'Length', [len(s) for s in huck_finn_chapters],
        'Periods', np.char.count(huck_finn_chapters, '.')
    ])
length_lw = Table().with_columns([
        'Length', [len(s) for s in little_women_chapters],
        'Periods', np.char.count(little_women_chapters, '.')
    ])
In [19]:
# The counts for Huckleberry Finn
length_hf
Out[19]:
Length Periods
7026 66
11982 117
8529 72
6799 84
8166 91
14550 125
13218 127
22208 249
8081 71
7036 70

... (33 rows omitted)

In [20]:
# The counts for Little Women
length_lw
Out[20]:
Length Periods
21759 189
22148 188
20558 231
25526 195
23395 255
14622 140
14431 131
22476 214
33767 337
18508 185

... (37 rows omitted)

In [21]:
Table.static_plots()
plt.figure(figsize=(10,10))
plt.scatter(length_hf[1], length_hf[0], color='darkblue')
plt.scatter(length_lw[1], length_lw[0], color='gold')
plt.xlabel('Number of periods in chapter')
plt.ylabel('Number of characters in chapter');
In [ ]: